Introduction

The data used in study is taken from 50hertz.com

data <- lapply(list.files(path ="./Input/50Hertz", full.name=TRUE), fread, skip=4, header=TRUE)
data <- rbindlist(data)
# Drop useless column
data[, V5:=NULL]
# Set column names 
setnames(data, c("Datum", "Von", "bis", "MW"), c("measurement_date", "from", "to", "MW"))
# Adding datetime columns
data[, from:=paste(measurement_date, from)] 
data[, to:=paste(measurement_date, to)] 
# Dropping date column
data[, measurement_date:=NULL]
# Parsing string to DateTime
data$from <- parse_date_time(data$from, "%d.%m.%y %H:%M", tz=getOption("tz"))
data$to <- parse_date_time(data$to, "%d.%m.%y %H:%M", tz=getOption("tz"))
data[format(to, "%H:%M") == "00:00", to:=to + days(1)]

The dataset contains 228547 measurements of MW, collected from 2015 till 2021.

Head of table
from to MW
2015-01-01 00:00:00 2015-01-01 00:15:00 12.078
2015-01-01 00:15:00 2015-01-01 00:30:00 11.852
2015-01-01 00:30:00 2015-01-01 00:45:00 11.808

Choosing one DateTime column for next analysis

Let’s choose to as datetime

Head of table
measurement_date MW
2015-01-01 00:15:00 12.078
2015-01-01 00:30:00 11.852
2015-01-01 00:45:00 11.808

Control Area

50Hertz is a transmission system operator for the north-east of Germany.

Main consumers of 50Hertz in Germany are:

  • Berlin
  • Brandenburg
  • Hamburg
  • Mecklenburg-Western
  • Pomerania
  • Saxony
  • Saxony-Anhalt
  • Thuringia

Control Area of 50hertzMap of transmission system operators in Germany

Choosing area for light/dark time of day and weather data

Let’s choose Berlin as basis for linkage weather and sunrise/sunset data. Because it’s located in the center of the north-eastern part of Germany.

data <- mutate(data,
       is_light = if_else(measurement_date %between% 
                          select(getSunlightTimes(date = as.Date(measurement_date), 
                                 lat = getOption("lat"), 
                                 lon = getOption("lon"), keep=c("sunrise", "sunset"), 
                                 tz=getOption("tz")), sunrise, sunset), TRUE, FALSE)
)
kable(head(data, 3), caption="Head of table")
Head of table
measurement_date MW is_light
2015-01-01 00:15:00 12.078 FALSE
2015-01-01 00:30:00 11.852 FALSE
2015-01-01 00:45:00 11.808 FALSE

Weather data nonnumerical columns exploration

Merge weather data

  • T - Air temperature (degrees Celsius) at 2 metre height above the earth’s surface;
  • P0 - Atmospheric pressure at weather station level (millimeters of mercury);
  • P - Atmospheric pressure reduced to mean sea level (millimeters of mercury);
  • U - Relative humidity (%) at a height of 2 metres above the earth's surface;
  • DD - Mean wind direction (compass points) at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation;
  • Ff - Mean wind speed at a height of 10-12 metres above the earth’s surface over the 10-minute period immediately preceding the observation (meters per second);
  • WW - Special present weather phenomena observed at or near the aerodrome;
  • c - Total cloud cover;
  • VV - Horizontal visibility (km);
  • Td - Dewpoint temperature at a height of 2 metres above the earth’s surface (degrees Celsius);
weather_data[, `:=`(WW=NULL, c=NULL, DD=NULL)]
setkey(data, measurement_date)
setkey(weather_data, m_date)
data <- weather_data[data, roll="nearest"]
setnames(data, c("m_date"), c("measurement_date"))
kable(head(data, 3), caption="Head of table")
Head of table
measurement_date T P0 P U Ff VV Td MW is_light
2015-01-01 00:15:00 5 768.9 773.4 100 5 1.8 5 12.078 FALSE
2015-01-01 00:30:00 5 768.9 773.4 100 5 1.8 5 11.852 FALSE
2015-01-01 00:45:00 5 768.9 773.4 100 5 1.3 5 11.808 FALSE

Data quality

kable(
data %>% 
  summarize_all(
    ~ sum(is.na(.))), 
caption="Missing values")
Missing values
measurement_date T P0 P U Ff VV Td MW is_light
0 2 2 0 2 3 0 2 0 0
data <- data %>% 
  mutate(T = na_locf(T),
         P0 = na_locf(P0),
         U = na_locf(U),
         Ff = na_locf(Ff),
         Td = na_locf(Td),
         VV = na_locf(VV))
data <- data[!are_duplicated(data, index=measurement_date)]

Data exploration

Outliers Y & X

data_tsbl <- as_tsibble(data, key=NULL, index = measurement_date)

data_daily <- data_tsbl %>% 
  index_by(dt = as.Date(measurement_date)) %>% 
  summarise(MW = mean(MW)) %>% 
  fill_gaps() %>% 
  mutate(MW = forecast::na.interp(MW))

data_daily %>%
  model(STL(MW ~ trend(window = 365) + season(period = "year"))) %>%
  components() %>% 
  autoplot() + theme_minimal()

data_anomaly <- data_tsbl %>% 
  time_decompose(MW, frequency = "1 year", trend="1 year") %>% 
  anomalize(remainder, alpha = 0.05, max_anoms = 0.35) %>% 
  time_recompose()

data_anomaly %>% plot_anomaly_decomposition(size_dots=0.2)

data_anomaly <- data_tsbl %>% 
  time_decompose(MW, frequency = "1 year", trend="1 year", method="twitter") %>% 
  anomalize(remainder, alpha = 0.05, max_anoms = 0.35) %>% 
  time_recompose()

data_anomaly %>% plot_anomaly_decomposition(size_dots=0.2)

Normality Y

QQ-plot

Anderson-Darling test

## 
##  Anderson-Darling normality test
## 
## data:  data$MW
## A = 95.625, p-value < 2.2e-16

Результаты теста на нормальность статистически значимы (p-value < \(2.2\cdot10^{-16}\)), таким образом, нулевая гипотеза о нормальности выборки в данном случае отвергается.

Ниже представлены результаты применения теста для различных групп (день/ночь + месяц). Все результаты получились статистически значимыми, то есть принимается альтернативная гипотеза.

Violin plot

Время суток

Так как \(p\)-value > 0.05, то мы не имеем оснований отвергнуть нулевую гипотезу о равенстве средних значений двух групп: ‘темное’ и ‘светлое’ времена суток.

##                 Df Sum Sq Mean Sq F value Pr(>F)
## is_light         1      0   0.035   0.009  0.925
## Residuals   228524 890639   3.897
## 
##  Kruskal-Wallis rank sum test
## 
## data:  MW by is_light
## Kruskal-Wallis chi-squared = 26.057, df = 1, p-value = 3.315e-07
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  data$MW and data$is_light 
## 
##      FALSE  
## TRUE 3.3e-07
## 
## P value adjustment method: holm

Тест Краскела-Уоллиса выявил статистически значимые различия между признаками is_light на MW (\(\chi^2(1)=26.057, p<0.05\)). Post-hoc тест на основе теста Манна-Уитни с корректировкой Холма выявил значимые различия между группами TRUE и FALSE (\(p < 0.05, r = 0.01\)).

Время года
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## season           3 162602   54201   17013 <2e-16 ***
## Residuals   228522 728038       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Kruskal-Wallis rank sum test
## 
## data:  MW by season
## Kruskal-Wallis chi-squared = 44346, df = 3, p-value < 2.2e-16
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  data$MW and data$season 
## 
##        Winter Spring Summer
## Spring <2e-16 -      -     
## Summer <2e-16 <2e-16 -     
## Fall   <2e-16 <2e-16 <2e-16
## 
## P value adjustment method: holm
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$MW and data$season 
## 
##        Winter Spring Summer
## Spring <2e-16 -      -     
## Summer <2e-16 <2e-16 -     
## Fall   <2e-16 <2e-16 <2e-16
## 
## P value adjustment method: holm

Тест Краскела-Уоллиса выявил значимый эффект признака MW на season (\(\chi^2(3)=44346, p < 0.05\)). Post-hoc тест на основе теста Манна-Уитни с корректировкой Холма выявил значимые различия между группами Winter и Spring (\(p < 0.05, r = 0.38\)), Winter и Summer (\(p < 0.05, r = 0.58\)), Winter и Fall (\(p < 0.05, r = 0.26\)), Spring и Summer (\(p < 0.05, r = 0.21\)), Spring и Fall (\(p < 0.05, r = 0.15\)), Summer и Spring (\(p < 0.05, r = 0.39\)).

День недели
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## weekday          6  59422    9904    2723 <2e-16 ***
## Residuals   228519 831217       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Kruskal-Wallis rank sum test
## 
## data:  MW by weekday
## Kruskal-Wallis chi-squared = 13971, df = 6, p-value < 2.2e-16
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  data$MW and data$weekday 
## 
##           Monday  Tuesday Wednesday Thursday Friday  Saturday
## Tuesday   < 2e-16 -       -         -        -       -       
## Wednesday < 2e-16 0.0162  -         -        -       -       
## Thursday  < 2e-16 0.5192  0.0032    -        -       -       
## Friday    < 2e-16 1.8e-13 < 2e-16   1.6e-11  -       -       
## Saturday  < 2e-16 < 2e-16 < 2e-16   < 2e-16  < 2e-16 -       
## Sunday    < 2e-16 < 2e-16 < 2e-16   < 2e-16  < 2e-16 < 2e-16 
## 
## P value adjustment method: holm
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$MW and data$weekday 
## 
##           Monday  Tuesday Wednesday Thursday Friday  Saturday
## Tuesday   < 2e-16 -       -         -        -       -       
## Wednesday < 2e-16 0.4891  -         -        -       -       
## Thursday  < 2e-16 0.0103  0.0014    -        -       -       
## Friday    < 2e-16 < 2e-16 < 2e-16   5.2e-15  -       -       
## Saturday  < 2e-16 < 2e-16 < 2e-16   < 2e-16  < 2e-16 -       
## Sunday    < 2e-16 < 2e-16 < 2e-16   < 2e-16  < 2e-16 < 2e-16 
## 
## P value adjustment method: holm

Тест Краскела-Уоллиса выявил значимый эффект признака MW на weekday (\(\chi^2(6)=13971, p < 0.05\)). Post-hoc тест на основе теста Манна-Уитни с корректировкой Холма выявил значимые различия между всеми сезонами, кроме сочетания Tuesday и Thursday (\(p = 0.52, r = 0.01\))

Homogeneity Y

##                 Df Sum Sq Mean Sq F value Pr(>F)
## is_light         1      0   0.035   0.009  0.925
## Residuals   228524 890639   3.897
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## season           3 162602   54201   17013 <2e-16 ***
## Residuals   228522 728038       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## weekday          6  59422    9904    2723 <2e-16 ***
## Residuals   228519 831217       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lockdown

Forecasting

Prophet

Для начала применим Prophet на наши данные в лоб.

Добавим в качестве предиктора температуру и is_light в виде условного режима сезонности, а также праздники Германии.

Видно, что точность модели улучшилась, а также тренд стал линейным нисходящим,попробуем посмотреть поближе на графики температуры.

Видно, что паттерны изменения температуры похожи на график предсказываемого значения MW.

Попробуем добавить в модель данные о локдауне.

Tune

Tune1

Add P0, U, Td, VV as regressors

tune 3

tune 4

tune 5

Hamburg